22 research outputs found
Deep Affordance-grounded Sensorimotor Object Recognition
It is well-established by cognitive neuroscience that human perception of
objects constitutes a complex process, where object appearance information is
combined with evidence about the so-called object "affordances", namely the
types of actions that humans typically perform when interacting with them. This
fact has recently motivated the "sensorimotor" approach to the challenging task
of automatic object recognition, where both information sources are fused to
improve robustness. In this work, the aforementioned paradigm is adopted,
surpassing current limitations of sensorimotor object recognition research.
Specifically, the deep learning paradigm is introduced to the problem for the
first time, developing a number of novel neuro-biologically and
neuro-physiologically inspired architectures that utilize state-of-the-art
neural networks for fusing the available information sources in multiple ways.
The proposed methods are evaluated using a large RGB-D corpus, which is
specifically collected for the task of sensorimotor object recognition and is
made publicly available. Experimental results demonstrate the utility of
affordance information to object recognition, achieving an up to 29% relative
error reduction by its inclusion.Comment: 9 pages, 7 figures, dataset link included, accepted to CVPR 201
A Deep Learning Approach to Object Affordance Segmentation
Learning to understand and infer object functionalities is an important step
towards robust visual intelligence. Significant research efforts have recently
focused on segmenting the object parts that enable specific types of
human-object interaction, the so-called "object affordances". However, most
works treat it as a static semantic segmentation problem, focusing solely on
object appearance and relying on strong supervision and object detection. In
this paper, we propose a novel approach that exploits the spatio-temporal
nature of human-object interaction for affordance segmentation. In particular,
we design an autoencoder that is trained using ground-truth labels of only the
last frame of the sequence, and is able to infer pixel-wise affordance labels
in both videos and static images. Our model surpasses the need for object
labels and bounding boxes by using a soft-attention mechanism that enables the
implicit localization of the interaction hotspot. For evaluation purposes, we
introduce the SOR3D-AFF corpus, which consists of human-object interaction
sequences and supports 9 types of affordances in terms of pixel-wise
annotation, covering typical manipulations of tool-like objects. We show that
our model achieves competitive results compared to strongly supervised methods
on SOR3D-AFF, while being able to predict affordances for similar unseen
objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202
Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image Segmentation
Generalising deep models to new data from new centres (termed here domains)
remains a challenge. This is largely attributed to shifts in data statistics
(domain shifts) between source and unseen domains. Recently, gradient-based
meta-learning approaches where the training data are split into meta-train and
meta-test sets to simulate and handle the domain shifts during training have
shown improved generalisation performance. However, the current fully
supervised meta-learning approaches are not scalable for medical image
segmentation, where large effort is required to create pixel-wise annotations.
Meanwhile, in a low data regime, the simulated domain shifts may not
approximate the true domain shifts well across source and unseen domains. To
address this problem, we propose a novel semi-supervised meta-learning
framework with disentanglement. We explicitly model the representations related
to domain shifts. Disentangling the representations and combining them to
reconstruct the input image allows unlabeled data to be used to better
approximate the true domain shifts for meta-learning. Hence, the model can
achieve better generalisation performance, especially when there is a limited
amount of labeled data. Experiments show that the proposed method is robust on
different segmentation tasks and achieves state-of-the-art generalisation
performance on two public benchmarks.Comment: Accepted by MICCAI 202
Noise-in, Bias-out: Balanced and Real-time MoCap Solving
Real-time optical Motion Capture (MoCap) systems have not benefited from the
advances in modern data-driven modeling. In this work we apply machine learning
to solve noisy unstructured marker estimates in real-time and deliver robust
marker-based MoCap even when using sparse affordable sensors. To achieve this
we focus on a number of challenges related to model training, namely the
sourcing of training data and their long-tailed distribution. Leveraging
representation learning we design a technique for imbalanced regression that
requires no additional data or labels and improves the performance of our model
in rare and challenging poses. By relying on a unified representation, we show
that training such a model is not bound to high-end MoCap training data
acquisition, and exploit the advances in marker-less MoCap to acquire the
necessary data. Finally, we take a step towards richer and affordable MoCap by
adapting a body model-based inverse kinematics solution to account for
measurement and inference uncertainty, further improving performance and
robustness. Project page: https://moverseai.github.io/noise-tailComment: Project page: https://moverseai.github.io/noise-tai
Learning Disentangled Representations in the Imaging Domain
Disentangled representation learning has been proposed as an approach to
learning general representations even in the absence of, or with limited,
supervision. A good general representation can be fine-tuned for new target
tasks using modest amounts of data, or used directly in unseen domains
achieving remarkable performance in the corresponding task. This alleviation of
the data and annotation requirements offers tantalising prospects for
applications in computer vision and healthcare. In this tutorial paper, we
motivate the need for disentangled representations, present key theory, and
detail practical building blocks and criteria for learning such
representations. We discuss applications in medical imaging and computer vision
emphasising choices made in exemplar key works. We conclude by presenting
remaining challenges and opportunities.Comment: Submitted. This paper follows a tutorial style but also surveys a
considerable (more than 200 citations) number of work
Compositionally Equivariant Representation Learning
Deep learning models often need sufficient supervision (i.e. labelled data)
in order to be trained effectively. By contrast, humans can swiftly learn to
identify important anatomy in medical images like MRI and CT scans, with
minimal guidance. This recognition capability easily generalises to new images
from different medical facilities and to new tasks in different settings. This
rapid and generalisable learning ability is largely due to the compositional
structure of image patterns in the human brain, which are not well represented
in current medical models. In this paper, we study the utilisation of
compositionality in learning more interpretable and generalisable
representations for medical image segmentation. Overall, we propose that the
underlying generative factors that are used to generate the medical images
satisfy compositional equivariance property, where each factor is compositional
(e.g. corresponds to the structures in human anatomy) and also equivariant to
the task. Hence, a good representation that approximates well the ground truth
factor has to be compositionally equivariant. By modelling the compositional
representations with learnable von-Mises-Fisher (vMF) kernels, we explore how
different design and learning biases can be used to enforce the representations
to be more compositionally equivariant under un-, weakly-, and semi-supervised
settings. Extensive results show that our methods achieve the best performance
over several strong baselines on the task of semi-supervised domain-generalised
medical image segmentation. Code will be made publicly available upon
acceptance at https://github.com/vios-s.Comment: Submitted. 10 pages. arXiv admin note: text overlap with
arXiv:2206.1453